Goto

Collaborating Authors

 sup 0


Attention as In-Context Empirical Bayes: A Two-Stage View via Particle Dynamics

arXiv.org Machine Learning

We study minimal attention-only transformers under all-token corruption and show they admit a two-stage empirical Bayes interpretation. A single attention step computes a kernel-weighted posterior mean with respect to the empirical distribution defined by the context. Depth refines this distribution through particle dynamics (Stage 1), while a long-range skip-connection carries the noisy input as a query for posterior inference (Stage 2), revealing distinct statistical roles for depth and attention residuals. The framework isolates a minimal setting in which the context itself induces a depth-dependent energy landscape governing in-context inference. We show that effective denoising can emerge without an explicit noise schedule: a fixed kernel bandwidth and finite integration horizon suffice, yielding a principled depth-noise relationship. We further establish a posterior-mean recovery guarantee for a class of well-behaved priors, where the empirical estimator converges to the Bayes-optimal predictor under asymptotic conditions. Connecting these dynamics to reverse-diffusion limits, our results provide a statistical interpretation of attention as in-context inference via sample-based posterior estimation, without explicit density modeling.


High-dimensional Limit of SGD for Diagonal Linear Networks

arXiv.org Machine Learning

Understanding the behavior of stochastic gradient methods is a central problem in modern machine learning. Recent work has highlighted diagonal linear networks as a simplified yet expressive setting for analyzing the optimization and generalization properties of neural models. In this work, we show that in the high-dimensional regime, stochastic gradient descent on diagonal linear networks is well-approximated by continuous dynamics governed by a stochastic differential equation (SDE), which explicitly decouples the drift from the gradient noise. We further derive a deterministic partial differential equation whose solution propagates the relevant state of the iterates and characterizes the time evolution of a broad class of observable statistics, including the risk, curvature, and other metrics for optimality. Finally, we show that, under a suitable parametrization, the stochastic dynamics are globally well posed and converge exponentially fast to zero risk with high probability, yielding a fully explicit non-asymptotic description of their long-time behavior. Numerical simulations corroborate our theoretical findings.


Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

arXiv.org Machine Learning

The transformer architecture [52], which underlies present-day Large Language Models, has been one of the main drivers of recent advances in machine learning and artificial intelligence. At each layer, the hidden state of the network is updated by sequentially applying two distinct operations: attention modules [3], which capture long-range interactions in the input sequence, and classical MultiLayer Perceptrons (MLPs), acting separately on each element of that sequence. Despite their empirical success, the mechanisms governing information propagation through depth, and the way attention and MLP blocks jointly shape internal representations, remain only partially understood from a theoretical viewpoint. Recent progress has come from viewing transformers in suitable scaling limits as deterministic mean-field interacting particle systems modeling the evolution of N tokens1 through the layers of the neural network architecture (the so-called residual stream dynamics), see, among others, [46, 26, 27, 45]. In these descriptions, depth plays the role of a continuous time variable, and, in the large-context regime (N), the evolution of token representations is encoded by a PDE for their empirical distribution. This viewpoint is closely connected to the literature on scaling laws, where the effect of various scaling exponents controlling the relative size of the network's hyperparameters (e.g., depth, width, context length) on the effective dynamics of the model




equaltoz = z 1tonormalize andhea Student ' - t-distribp(z) = 8

Neural Information Processing Systems

Let w =( 1.5,0,..0) N(0,0.5) Denoting (25) utionofonwsameasthatof (26) eyobservationisthat.., Z1/2w k are Toseewhythisisthecase, wecanvectorizeeachterm: First, let' Lemma ForanyF :Rd R!R+, define problem 1,..., k, as : = su Next, let' 2021) Provingthe 31 Lf (w, b) C(w)2 n (49) tobetheleft(47)(wherethe ( (w),b)isused depends wonlythrough (w)).


A Organization of the Appendices

Neural Information Processing Systems

In the Appendix, we give proofs of all results from the main text. We say a function f: R Y! R is M -Lipschitz if for any y 2Y and ˆ y We can also define the Moreau envelope of a function f: R Y! R by The proof of all results in this section can be straightforwardly extended to these settings. Boyd et al. 2004; Bauschke, Combettes, et al. 2011; Rockafellar 1970), but is also useful and Interestingly, there is a similar equivalent characterization for Lipschitz functions as well. Finally, we show that any smooth loss is square-root-Lipschitz. Lipschitz losses is more general than the class of smooth losses studied in Srebro et al. 2010 .



Pathway to $O(\sqrt{d})$ Complexity bound under Wasserstein metric of flow-based models

arXiv.org Artificial Intelligence

We provide attainable analytical tools to estimate the error of flow-based generative models under the Wasserstein metric and to establish the optimal sampling iteration complexity bound with respect to dimension as $O(\sqrt{d})$. We show this error can be explicitly controlled by two parts: the Lipschitzness of the push-forward maps of the backward flow which scales independently of the dimension; and a local discretization error scales $O(\sqrt{d})$ in terms of dimension. The former one is related to the existence of Lipschitz changes of variables induced by the (heat) flow. The latter one consists of the regularity of the score function in both spatial and temporal directions. These assumptions are valid in the flow-based generative model associated with the Föllmer process and $1$-rectified flow under the Gaussian tail assumption. As a consequence, we show that the sampling iteration complexity grows linearly with the square root of the trace of the covariance operator, which is related to the invariant distribution of the forward process.


Deep Neural Operator Learning for Probabilistic Models

arXiv.org Artificial Intelligence

We propose a deep neural-operator framework for a general class of probability models. Under global Lipschitz conditions on the operator over the entire Euclidean space-and for a broad class of probabilistic models-we establish a universal approximation theorem with explicit network-size bounds for the proposed architecture. The underlying stochastic processes are required only to satisfy integrability and general tail-probability conditions. We verify these assumptions for both European and American option-pricing problems within the forward-backward SDE (FBSDE) framework, which in turn covers a broad class of operators arising from parabolic PDEs, with or without free boundaries. Finally, we present a numerical example for a basket of American options, demonstrating that the learned model produces optimal stopping boundaries for new strike prices without retraining.